class: center, middle, inverse, title-slide .title[ # Survey Data Analysis with Kobocruncher ] .subtitle[ ## Session 6 - Cleaning and Indicator Calculation ] .author[ ###
Link to Documentation
–
Link to Previous Session
–
Link to Next Session
] .date[ ### Training Content as of 08 December 2022 ] --- ## When do yo need to clean the data? Survey data cleaning involves identifying and removing responses from individuals who either don’t match your target audience criteria or didn’t answer your questions thoughtfully. This filtering is done to avoid drawing misleading conclusions. Data cleaning remains a last resort option that can be at first minimized by: * .large[__Quality of questionnaire design__] not only to minimize social desirability and biased questions but also to ensure that the interview duration is limited (_ideally less than 45 minutes for a face to face interview and less than 25 minutes for a telephone interview_) * .large[__Good form encoding__] - with well defined [constraints](https://xlsform.org/en/#constraints) and [skip logic](https://xlsform.org/en/#relevant) and [requirement](https://xlsform.org/en/#required) to avoid Inconsistent Responses, sufficient testing to ensure that the questions are well understood and the responses options are covering well the options * .large[__Good training for the data enumerator__] and detailed [question hints](https://xlsform.org/en/#hints) so the enumerators fills correctly the questionnaire * .large[__Sufficient data collection quality monitoring__] to identify, prevent and cure issues early on. This can be done through [High Frequency Check](https://unhcr.github.io/HighFrequencyChecks/docs/). This should help to early on flag Straightlining / Patterned Responses if one enumerator uses for instance the same answer option ("B") over and over (as an example for at least five rows in a grid)... .bg-blue[ For data quality, prevention is lot more effective, quicker and cheaper than curing. Take the time to thoroughly test the questionnaire before starting full on data collection. ] ??? https://dimewiki.worldbank.org/Checklist:_Data_Cleaning https://dimewiki.worldbank.org/Data_Cleaning --- ## Cleaning is the most time consuming task: go through your initial exploration report to identify issues! In order to guide this selection phase, data experts, in collaboration with the data analysis group, can use the following elements: * For numeric value, check the .large[__frequency distributions__] of each variable to average, deviation, including outlier and oddities * For categorical variables, check for .large[__unexpected values__]: any weird results based on common sense expectations * Use cross-tabulation to verify potential .large[__illogical combination__] of answers (for instance "pregnant men"... ) * Use correlation analysis to check for potential .large[__contradictions__] in respondents answers to different questions for identified associations (chi-square) * Always, Check for .large[__missing data__] (NA) or "%of respondent who answered" (in the chart caption) that you cannot confidently explain * Check unanswered questions, that corresponds to .large[__unused skip logic__] in the questionnaire: For instance, did a person who was never displaced answer displacement-related questions? Were employment-related answers provided for a toddler? --- ## Variables cleaning Cleaning involves cleaning records (rows) and variable (columns). Cleaning variables is relevant for a few situations: * System variables ( precise date) and Section timestamps * Removing .large[__direct identifiers__] such as: * Name and surname * Document number (passport, national ID, driving license, etc.) * Address or other precise geographic information * GPS coordinates, if collected during face to face interviews. In some situation, it's possible to at least decrease the accuracy of the coordinates (removing digit).... * Telephone number .large[Fix them] by setting the concerned variable as `identifier` in the `anonymise` column the `xlsform`: see session [08-Anonymising](08-Anonymising.html) --- ## Situation when you will still need minimum record cleaning a priori Whatever is quality of form design, enumerator training and data collection quality monitoring, there will be still cases where cleaning will involve removing entire records: * Remove from the dataset records where no consent were obtained and/or more broadly one a specific filter/condition (where the respondent do not meet certain criteria or data from an unreliable enumerator identified during data collection quality monitoring... * Remove duplicate respondent ID based on the original sample list * Often survey includes nested tables (aka `repeat`), so if you remove records from the main table, you need to remove linked records in the nested tables `kobo_remove` will remove records based on a specific filter. For instance if you want to remove all the records from the enumerator edouard (assuming I have a variable called `enumerator` where the enumerator name is recorded) > kobo_remove( datalist = datalist, filter( datalist[["main]]$numerator == "edouard")) --- ## Clean based on time * Remove from the dataset records before or at specific dates .large[Fix them] by setting within the `xlsform`: * Starting date of the data collection in the `clean` column / `start` row * Ending date of the data collection in the `clean` column / `End` row * Remove from the dataset records when interview duration appears as outliers, either too long or too short, aka "speed responses" --- ## Situation when you will still need minimum cleaning a posteriori In other cases, cleaning will involve .large[`recoding`] some variables: 1. Recode un-explainable .large[__outliers for numerical questions__]. An example of this would be if you asked how much water one person use in a day and someone answered that they use 1000 liters, while the second largest usage reported is 150 liters. 2. Recode questions consecutive from .large[__"or other" choices__]. 3. Recode some questions answer as .large[__new calculated variables__] to have more balanced response categories based on frequency or potential closes meaning --- ## Outliers for numerical questions * Outliers: values significantly different from others * Outliers should be removed or modified only if they are (clearly) wrong values * Common outlier definition: observations three standard deviations from the mean .large[identify outliers] by looking at histograms, boxplots and scatter plots from the exploration report .large[Fix them] by setting the maximum accepted standard deviation for the variable in the `clean` column of the `xlsform` --- ## Recode categories and/or treat "or other" choices Often, categorical questions include an `or other` variables and this option might be mis-used. In the exploratory report, those are plotted with a word cloud .large[Fix them] by setting the `clean` column of the `xlsform`. For instance, first question `shelter` includes an option `other` which trigger a subsequent text question `shelter_other` .pull-left[ * If you want to clean the badly categorized `other` from `shelter`, on the row `shelter_other` in te column `clean`, insert > shelter == "other" .bg-blue[ This will create automatically an additional cleaning log in xlsform called `clean_shelter_other` where the first column is the original value and the second column is the cleaned value. You can then manually edit the second column and the next time you will run `kobo_clean`, __the 'other' value__ will be automatically replaced by the value from your cleaning log if different from the original value ] ] .pull-right[ * If you want to adjust the current categorized from `shelter`, on the row `shelter` in te column `clean`, insert > shelter .bg-blue[ This will create automatically an additional cleaning log in xlsform called `clean_shelter` where the first column is the original value and the second column is the cleaned value. You can then manually edit the second column and the next time you will run `kobo_clean`, __all values__ will be automatically replaced by the values from your cleaning log if different from the original value ] ] --- ## Use case for new calculated variables * filters on specific criteria * Create a filters on specific criteria * Ratio between 2 numeric variable * Calculation on date * Discretization of numeric variable according to quintile * Discretization of numeric variable according to fixed break * Aggregate variable from nested frame (aka within repeat) to parent table * filters on specific criteria --- ## kobo_indicator In kobocruncher, this is done with [kobo_indicator](https://edouard-legoupil.github.io/kobocruncher/reference/kobo_indicator.html) function. The function goes through steps: 1 - load the already defined indicators in the xlsform 2 - append new indicator supplied to the function if any, 3 - apply the indicator, i.e. do the calculation, 4 - re-save all the working indicator definition within the extended xlsform, dedicated indicator worksheet 5 - bind the new indicators in the dictionary for further plotting 6 - rebuild the plan if indicators are allocated to chapter, subchapter --- ## Create a new variable based on a combination of specific criteria .pull-lef[ When adding a new indicator a few elements should be provided: * Name and Label for the new variable: ideally name should be consice and meaningful (less than 12 characters) and label for any regula label should be less than 80 characters * Type: a calculated variable will be either of type `select_one` or `numeric`. Documenting the indicator type will allow the indicator to be crunched automatically * list_name and list_label are the labeling for the response options in case the indicator is of type `select_one`. This will allow for automatic relablling for charting * repeatvar: you need to document to what frame the new indicator will be calculated. Basically for most household survey, this will be either the household (aka the first frame, named main per default and referenced as `datalist[[\"main\"]]` ) or the frame for individuals (for intance if,within the repeat of xlsform, the frame was named members, `datalist[[\"members\"]]`) * Calculation: this is the complex part. Calculation should be defined as an R statement using data manipulation functions. In order to build the statement you will need to identify the correct variable name and response name. ] .pull-right[ ```r indicatoradd <- c( name = "inColombia", label = "Is from Colombia", type = "select_one", repeatvar = "datalist[[\"main\"]]", calculation = "dplyr::if_else(datalist[[\"main\"]]$variable ==\"criteria\", \"yes\",\"no\")" ) ## then we add our indicators and expand expanded <- kobo_indicator(datalist = datalist, dico = dico, indicatoradd = indicatoradd , xlsformpath = xlsformpath, xlsformpathout = xlsformpathout) dico <- expanded[["dico"]] datalist <- expanded[["datalist"]] ``` ] --- ## The key data manipulation verbs for indicator calculations * `dplyr::mutate()` used to create a new variable * `dplyr::filter()` Extract rows that meet logical criteria- for instance when you want to extract the information from the head of household from the nested information and append it to the household information * `dplyr::if_else()` & `dplyr::case_when()` used to apply a specific condition (if) or multiplace condition (case) in order to create a calculated variable * `dplyr::group_by()`, `dplyr::summarise()` & `dplyr::ungroup()` are used to perform aggregation and calculation (count, sum) based on specific group * `dplyr::left_join()` used to merge information from 2 frames, for instance on one side the information available at household level and the one on individual level ??? https://posit.co/wp-content/uploads/2022/10/data-transformation-1.pdf https://posit.co/wp-content/uploads/2022/10/tidyr.pdf --- ## Ratio between 2 numeric variable .pull-lef[ ] .pull-right[ ```r indicatoradd <- c( name = "ratio", label = "Ratio varnum1 on varnum2", type = "numeric", repeatvar = "datalist[[\"main\"]]", calculation = "datalist[[\"main\"]]varnum1 / datalist[[\"main\"]]varnum2" ) ``` ] --- ## Calculation on date .pull-lef[ ] .pull-right[ ```r indicatoradd <- c( name = "duration", label = "Difference between today and datetocheck", type = "numeric", repeatvar = "datalist[[\"main\"]]", calculation = "lubridate::interval( datalist[[\"main\"]]$datetocheck, lubridate::today()) %/% months(1)" ) ``` ] --- ## Discretization of numeric variable according to quintile .pull-lef[ ] .pull-right[ ```r indicatoradd <- c( name = "varnum_cat", label = "Discretise the varnum into quintile", type = "select_one", repeatvar = "datalist[[\"main\"]]", calculation = "Hmisc::cut2(datalist[1]$varnum, g =5)" ) ``` ] --- ## Discretization of numeric variable according to fixed break .pull-lef[ for instance case size from integer to categoric ] .pull-right[ ```r indicatoradd <- c( name = "varnum_cat", label = "Discretise the varnum into fixed break", type = "select_one", repeatvar = "datalist[[\"main\"]]", calculation = "cut(datalist[1]$casesize, breaks = c(0, 1, 2, 3,5,30), labels = c(\"Case.size.1\", \"Case.size.2\", \"Case.size.3\", \"Case.size.4.5\", \"Case.size.6.or.more\" ), include.lowest=TRUE)" ) ``` ] --- ## Aggregate variable from nested frame (aka within repeat) to parent table .pull-lef[ ] .pull-right[ ```r indicatoradd <- c( name = "vnumber_femala_HH", label = "Number of female in the household", hint = "this indicator counts the number of females as registered in the household roster" type = "integer", repeatvar = "datalist[[\"main\"]]", calculation = "datalist[2] |> dplyr::select( members.sex, parent_index) |> tidyr::gather( parent_index, members.sex) |> dplyr::count(parent_index, members.sex) |> tidyr::spread(members.sex, n, fill = 0) |> dplyr::select( female)" ) datalist[2] |> dplyr::select( members.sex, parent_index) |> tidyr::gather( parent_index, members.sex) |> dplyr::count(parent_index, members.sex) |> tidyr::spread(members.sex, n, fill = 0) |> dplyr::select( female) ``` ] --- class: inverse, center, middle # TIME TO PRACTISE ON YOUR OWN! ### .large[.white[
] **10 minutes! **]
−
+
10
:
00
Open again your expanded xlsfrom, set up the outliers treatment, clean the _"or_other"_ in the .large[clean] column. Then add calculated variables. Save and knit again your report! Do not hesitate to raise your questions in the [ticket system](https://github.com/Edouard-Legoupil/kobocruncher/issues/new) or in the chat so the training content can be improved accordingly! --- class: inverse, center, middle ### .large[.white[
] **Let's take a break! **]
−
+
05
:
00
__Next session__: [07-Weighting If the data was created through a probabilistic selection sampling approach, then we can apply weighting to the data before and regenerate the report so that those weights are reflected](07-Weighting.html)